A Fast Decision Tree Learning Algorithm
نویسندگان
چکیده
There is growing interest in scaling up the widely-used decision-tree learning algorithms to very large data sets. Although numerous diverse techniques have been proposed, a fast tree-growing algorithm without substantial decrease in accuracy and substantial increase in space complexity is essential. In this paper, we present a novel, fast decision-tree learning algorithm that is based on a conditional independence assumption. The new algorithm has a time complexity of O(m · n), where m is the size of the training data and n is the number of attributes. This is a significant asymptotic improvement over the time complexity O(m · n) of the standard decision-tree learning algorithm C4.5, with an additional space increase of only O(n). Experiments show that our algorithm performs competitively with C4.5 in accuracy on a large number of UCI benchmark data sets, and performs even better and significantly faster than C4.5 on a large number of text classification data sets. The time complexity of our algorithm is as low as naive Bayes’. Indeed, it is as fast as naive Bayes but outperforms naive Bayes in accuracy according to our experiments. Our algorithm is a core tree-growing algorithm that can be combined with other scaling-up techniques to achieve further speedup. Introduction and Related Work Decision-tree learning is one of the most successful learning algorithms, due to its various attractive features: simplicity, comprehensibility, no parameters, and being able to handle mixed-type data. In decision-tree learning, a decision tree is induced from a set of labeled training instances represented by a tuple of attribute values and a class label. Because of the vast search space, decision-tree learning is typically a greedy, top-down and recursive process starting with the entire training data and an empty tree. An attribute that best partitions the training data is chosen as the splitting attribute for the root, and the training data are then partitioned into disjoint subsets satisfying the values of the splitting attribute. For each subset, the algorithm proceeds recursively until all instances in a subset belong to the same class. A typical tree-growing algorithm, such as C4.5 (QuinCopyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. lan 1993), has a time complexity of O(m · n), where m is the size of the training data and n is the number of attributes. Since large data sets with millions of instances and thousands of attributes are not rare today, the interest in developing fast decision-tree learning algorithms is rapidly growing. The major reason is that decision-tree learning works comparably well on large data sets, in addition to its general attractive features. For example, decision-tree learning outperforms naive Bayes on larger data sets, while naive Bayes performs better on smaller data sets (Kohavi 1996; Domingos & Pazzani 1997). A similar observation has been noticed in comparing decision-tree learning with logistic regression (Perlich, Provost, & Simonoff 2003). Numerous techniques have been developed to speed up decision-tree learning, such as designing a fast tree-growing algorithm, parallelization, and data partitioning. Among them, a large amount of research work has been done on reducing the computational cost related to accessing the secondary storage, such as SLIQ (Mehta, Agrawal, & Rissanen 1996), SPRINT (Shafer, Agrawal, & Mehta 1996), or Rainforest (Gehrke, Ramakrishnan, & Ganti 2000). An excellent survey is given in (Provost & Kolluri 1999). Apparently, however, developing a fast tree-growing algorithm is more essential. There are basically two approaches to designing a fast tree-growing algorithm: searching in a restricted model space, and using a powerful search heuristic. Learning a decision tree from a restricted model space achieves the speedup by avoiding searching the vast model space. One-level decision trees (Holte 1993) are a simple structure in which only one attribute is used to predict the class variable. A one-level tree can be learned quickly with a time complexity of O(m·n), but its accuracy is often much lower than C4.5’s. Auer et al. (1995) present an algorithm T2 for learning two-level decision trees. However, It has been noticed that T2 is no more efficient than even C4.5 (Provost & Kolluri 1999). Learning restricted decision trees often leads to performance degradation in some complex domains. Using a powerful heuristic to search the unrestricted model space is another realistic approach. Indeed, most standard decision-tree learning algorithms are based on heuristic search. Among them, the decision tree learning algorithm C4.5 (Quinlan 1993) has been well recognized as the reigning standard. C4.5 adopts information gain as the criterion (heuristic) for splitting attribute selection and has a time complexity of O(m ·n2). Note that the number n of attributes corresponds to the depth of the decision tree, which is an important factor contributing to the computational cost for tree-growing. Although one empirical study suggests that on average the learning time of ID3 is linear with the number of attributes (Shavlik, Mooney, & Towell 1991), it has been also noticed that C4.5 does not scale well when there are many attributes (Dietterich 1997). The motivation of this paper is to develop a fast algorithm searching the unrestricted model space with a powerful heuristic that can be computed efficiently. Our work is inspired by naive Bayes, which is based on an unrealistic assumption: all attributes are independent given the class. Because of the assumption, it has a very low time complexity of O(m ·n), and still performs surprisingly well (Domingos & Pazzani 1997). Interestingly, if we introduce a similar assumption in decision-tree learning, the widely used information-gain heuristic can be computed more efficiently, which leads to a more efficient tree-growing algorithm with the same asymptotic time complexity of O(m ·n) with naive Bayes and one-level decision trees. That is the key idea of this paper. A Fast Tree-Growing Algorithm To simplify our discussion, we assume that all attributes are non-numeric, and each attribute then occurs exactly once on each path from leaf to root. We will specify how to cope with numeric attributes later. In the algorithm analysis in this paper, we assume that both the number of classes and the number of values of each attribute are much less than m and are then discarded. We also assume that all training data are loaded to the main memory. Conditional Independence Assumption In tree-growing, the heuristic plays a critical role in determining both classification performance and computational cost. Most modern decision-tree learning algorithms adopt a (im)purity-based heuristic, which essentially measures the purity of the resulting subsets after applying the splitting attribute to partition the training data. Information gain, defined as follows, is widely used as a standard heuristic. IG(S, X) = Entropy(S)− ∑ x |Sx| |S| Entropy(Sx), (1) where S is a set of training instances, X is an attribute and x is its value, Sx is a subset of S consisting of the instances with X = x, and Entropy(S) is defined as Entropy(S) = − |C| ∑ i=1 PS(ci)logPS(ci), (2) where PS(ci) is estimated by the percentage of instances belonging to ci in S, and |C| is the number of classes. Entropy(Sx) is similar. Note that tree-growing is a recursive process of partitioning the training data and S is the training data associated with the current node. Then, PS(ci) is actually P (ci|xp) on the entire training data, where Xp is the set of attributes along the path from the current node to the root, called path attributes, and xp is an assignment of values to the variables in Xp. Similarly, PSx(ci) is P (ci|xp, x) on the entire training data. In the tree-growing process, each candidate attribute (the attributes not in Xp) is examined using Equation 1, and the one with the highest information gain is selected as the splitting attribute. The most time-consuming part in this process is evaluating P (ci|xp, x) for computing Entropy(Sx). It must pass through each instance in Sx, for each of which it iterates through each candidate attribute X . This results in a time complexity of O(|S|·n). Note that the union of the subsets on each level of the tree is the entire training data of size m, and the time complexity for each level is thus O(m · n). Therefore, the standard decision-tree learning algorithm has a time complexity of O(m · n). Our key observation is that we may not need to pass through S for each candidate attribute to estimate P (ci|xp, x). According to probability theory, we have P (ci|xp, x) = P (ci|xp)P (x|xp, ci)
منابع مشابه
MMDT: Multi-Objective Memetic Rule Learning from Decision Tree
In this article, a Multi-Objective Memetic Algorithm (MA) for rule learning is proposed. Prediction accuracy and interpretation are two measures that conflict with each other. In this approach, we consider accuracy and interpretation of rules sets. Additionally, individual classifiers face other problems such as huge sizes, high dimensionality and imbalance classes’ distribution data sets. This...
متن کاملدستهبندی دادههای دوردهای با ابرمستطیل موازی محورهای مختصات
One of the machine learning tasks is supervised learning. In supervised learning we infer a function from labeled training data. The goal of supervised learning algorithms is learning a good hypothesis that minimizes the sum of the errors. A wide range of supervised algorithms is available such as decision tress, SVM, and KNN methods. In this paper we focus on decision tree algorithms. When we ...
متن کاملVoltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm
A new technique presents to improve the performance of dynamic voltage restorer (DVR) for voltage sag mitigation. This control scheme is based on cuckoo search algorithm with tree fuzzy rule based classifier (CSA-TFRC). CSA is used for optimizing the output of TFRC so the classification output of the network is enhanced. While, the combination of cuckoo search algorithm, fuzzy and decision tree...
متن کاملExtremely Fast Decision Tree
We introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree—“Extremely Fast Decision Tree”, a minor modification to theMOA implementation of Hoeffding Tree—obtains significantly superior prequential accuracy onmost...
متن کاملComparative Analysis of Machine Learning Algorithms with Optimization Purposes
The field of optimization and machine learning are increasingly interplayed and optimization in different problems leads to the use of machine learning approaches. Machine learning algorithms work in reasonable computational time for specific classes of problems and have important role in extracting knowledge from large amount of data. In this paper, a methodology has been employed to opt...
متن کاملSteel Buildings Damage Classification by damage spectrum and Decision Tree Algorithm
Results of damage prediction in buildings can be used as a useful tool for managing and decreasing seismic risk of earthquakes. In this study, damage spectrum and C4.5 decision tree algorithm were utilized for damage prediction in steel buildings during earthquakes. In order to prepare the damage spectrum, steel buildings were modeled as a single-degree-of-freedom (SDOF) system and time-history...
متن کامل